arXiv:1506.04917v2 [cs.DS] 22 Dec 2015 


Linear-Time Sequence Comparison Using 
Minimal Absent Words &: Applications 


Maxime Crochemore^, Gabriele Fici^, Robert Merca§^’^, and Solon P. Pissis^ 

^ Department of Informatics, King’s College London, UK 
^ Dipartimento di Matematica e Informatica, Universita di Palermo, Italy 
® Department of Computer Science, Kiel University, Germany 
maxime.crochemoreSkcl.ac.uk, gabriele.ficiOunipa.it, 
rgmSinformatik.uni-kiel.de, solon.pissisOkcl.ac.uk 


Abstract. Sequence comparison is a prerequisite to virtually all com¬ 
parative genomic analyses. It is often realized by sequence alignment 
techniques, which are computationally expensive. This has led to in¬ 
creased research into alignment-free techniques, which are based on mea¬ 
sures referring to the composition of sequences in terms of their con¬ 
stituent patterns. These measures, such as g-gram distance, are usually 
computed in time linear with respect to the length of the sequences. In 
this article, we focus on the complementary idea: how two sequences can 
be efficiently compared based on information that does not occur in the 
sequences. A word is an absent word of some sequence if it does not occur 
in the sequence. An absent word is minimal if all its proper factors occur 
in the sequence. Here we present the first linear-time and linear-space 
algorithm to compare two sequences by considering all their minimal ab¬ 
sent words. In the process, we present results of combinatorial interest, 
and also extend the proposed techniques to compare circular sequences. 

Keywords: algorithms on strings, sequence comparison, alignment-free 
comparison, absent words, forbidden words, circular words. 


1 Introduction 

Sequence comparison is an important step in many basic tasks in bioinformatics, 
from phylogenies reconstruction to genomes assembly. It is often realized by 
sequence alignment techniques, which are computationally expensive, requiring 
quadratic time in the length of the sequences. This has led to increased research 
into alignment-free techniques. Hence standard notions for sequence comparison 
are gradually being complemented and in some cases replaced by alternative 
ones [ini. One such notion is based on comparing the words that are absent 
in each sequence [T]. A word is an absent word (or a forbidden word) of some 
sequence if it does not occur in the sequence. Absent words represent a type of 
negative information: information about what does not occur in the sequence. 

Given a sequence of length n, the number of absent words of length at most n 
is exponential in n. However, the number of certain classes of absent words is only 
linear in n. This is the case for minimal absent words, that is, absent words in the 
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sequence whose all proper factors occur in the sequence [5] . An upper bound on 
the number of minimal absent words is known to be 0{a'n) [9123] . where cr is the 
size of the alphabet E. Hence it may be possible to compare sequences in time 
proportional to their lengths, for a fixed-sized alphabet, instead of proportional 
to the product of their lengths. In what follows, we consider sequences on a fixed¬ 
sized alphabet since the most commonly studied alphabet is S — {A,C,G,T}. 

An 0(n)-time and 0(n)-space algorithm for computing all minimal absent 
words on a fixed-sized alphabet based on the construction of suffix automata 
was presented in [^. The computation of minimal absent words based on the 
construction of suffix arrays was considered in [55] ; although this algorithm has 
a linear-time performance in practice, the worst-case time complexity is 0{n^). 
New 0{n)-time and (!I(n)-space suffix-array-based algorithms were presented 
in to bridge this unpleasant gap. An implementation of the algorithm 

presented in [2] is currently, and to the best of our knowledge, the fastest available 
for the computation of minimal absent words. A more space-efficient solution to 
compute all minimal absent words in time 0{n) was also presented in |^. 

In this article, we consider the problem of comparing two sequences x and y 
of respective lengths m and n, using their sets of minimal absent words. In [7], 
Chairungsee and Crochemore introduced a measure of similarity between two 
sequences based on the notion of minimal absent words. They made use of a 
length-weighted index to provide a measure of similarity between two sequences, 
using sample sets of their minimal absent words, by considering the length of 
each member in the symmetric difference of these sample sets. This measure can 
be trivially computed in time and space 0{m n) provided that these sample 
sets contain minimal absent words of some bounded length For unbounded 
length, the same measure can be trivially computed in time Oijnfi -I- n^): for a 
given sequence, the cumulative length of all its minimal absent words can grow 
quadratically with respect to the length of the sequence. 

The same problem can be considered for two circular sequences. The measure 
of similarity of Chairungsee and Crochemore can be used in this setting provided 
that one extends the definition of minimal absent words to circular sequences. 
In Sectional we give a definition of minimal absent words for a circular sequence 
from the Formal Language Theory point of view. We believe that this definition 
may also be of interest from the point of view of Symbolic Dynamics, which is 
the original context in which minimal absent words have been introduced |5]. 

Our Contribution. Here we make the following threefold contribution: 

a) We present an 0{m-\-n)-time and 0(m-|-n)-space algorithm to compute the 
similarity measure introduced by Chairungsee and Crochemore by consid¬ 
ering all minimal absent words of two sequences x and y of lengths m and 
n, respectively; thereby showing that it is indeed possible to compare two 
sequences in time proportional to their lengths (Section |3|). 

b) We show how this algorithm can be applied to compute this similarity mea¬ 
sure for two circular sequences x and y of lengths m and n, respectively, 
in the same time and space complexity as a result of the extension of the 
definition of minimal absent words to circular sequences (Section |4|). 
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c) We provide an open-source code implementation of our algorithms and in¬ 
vestigate potential applications of our theoretical findings (Section [5]). 


2 Definitions and Notation 


We begin with basic definitions and notation. Let y = j/[0]j/[l]. .y[n — 1] be a 
word of length n = |?/| over a finite ordered alphabet S of size a = 1171 = 0(1). 
For two positions i and j on y, we denote by y[i--j] = y[i\--y[j] the factor 
(sometimes called substring) of y that starts at position i and ends at position j 
(it is empty if j < i), and by e the empty word, word of length 0. We recall that 
a prefix of y is a factor that starts at position 0 (y[0.. yj) and a suffix is a factor 
that ends at position n — 1 (y[z.. n — 1]), and that a factor of y is a proper factor 
if it is not y itself. The set of all the factors of the word y is denoted by Fy. 

Let X be a word of length 0 < m < n. We say that there exists an occurrence 
of X in y, or, more simply, that x occurs in y, when x is a factor of y. Every 
occurrence of x can be characterised by a starting position in y. Thus we say that 
X occurs at the starting position i in y when x = y[i. A + m — V\. Opposingly, 
we say that the word x is an absent word of y if it does not occur in y. The 
absent word x of y is minimal if and only if all its proper factors occur in y. The 
set of all minimal absent words for a word y is denoted by My. For example, if 
y = abaab, then My = {aaa, aaba, bab, bb}. In general, if we suppose that all the 
letters of the alphabet appear in y of length n, the length of a minimal absent 
word of y lies between 2 and n -|- 1. It is equal to n -|- 1 if and only if y is the 
catenation of n copies of the same letter. So, if y contains occurrences of at least 
two different letters, the length of a minimal absent word for y is bounded from 
above by n. 

A language over the alphabet if is a set of finite words over S. A language is 
regular if it is recognized by a finite state automaton. A language is factorial if it 
contains all the factors of its words. A language is antifactorial if no word in the 
language is a proper factor of another word in the language. Given a word x, the 
language generated by x is the language x* = {x* | fc > 0} = {e, x, xx, xxx,...}. 
The factorial closure of a language L is the language Fl = {Fy \ y G L}. Given 
a factorial language L, one can define the (antifactorial) language of minimal 
absent words ioi L ns Ml = {aub \ aub ^ L, au, ub £ L}. Notice that Ml not 
the same language as the union of Mx for x G L. 

We denote by SA the suffix array of y of length n, that is, an integer array 
of size n storing the starting positions of all (lexicographically) sorted suffixes of 
y, i.e. for all 1 < r < n we have y[SA[r — 1].. n — 1] < y[SA[r].. n — 1] [22]. Let 
lcp(r, s) denote the length of the longest common prefix between y[SA[r].. n — 1] 
and y[SA[s]. .n — 1] for all positions r, s on y, and 0 otherwise. We denote by 
LCP the longest common prefix array of y defined by LCP[r] = lcp(r — l,r) for 
all 1 < r < n, and LCP[0] = 0. The inverse iSA of the array SA is defined by 
iSA[SA[r]] = r, for all 0 < r < n. It is known that SA [25], iSA, and LCP [12] of 
a word of length n can be computed in time and space 0{n). 
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In what follows, as already proposed in [2], for every word y, the set of 
minimal words associated with y, denoted by Aiy, is represented as a set of 
tuples where the corresponding minimal absent word a: of y is defined 

by a:[0] = a, a G S, and x[l .. m — 1] = y[i. .j], where j — * + 1 = m > 2. It is 
known that if ly] = n and |i;'| = ct, then \My\ < an [23] , 

In [7], Chairungsee and Crochemore introduced a measure of similarity be¬ 
tween two words X and y based on the notion of minimal absent words. Let 
(resp. My) denote the set of minimal absent words of length at most i oi x 
(resp. y). The authors made use of a length-weighted index to provide a mea¬ 
sure of the similarity between x and y, using their sample sets Adf, and Ady, by 
considering the length of each member in the symmetric difference (Adf, A My) 
of the sample sets. For sample sets M^. and My, they defined this index to be 





This work considers the following generalized version of the same problem. 
MAW-SequenceComparison 

Input: a word x of length m and a word y of length n 

Output: LW(Mx,My), where Mx and My denote the sets of minimal ab¬ 
sent words of X and y, respectively. 


We also consider the aforementioned problem for two circular words. A cir¬ 
cular word of length m can be viewed as a traditional linear word which has 
the left- and right-most letters wrapped around and stuck together in some way. 
Under this notion, the same circular word can be seen as m different linear 
words, which would all be considered equivalent. More formally, given a word 
X of length m, we denote by x^’'^ = x[i..m — l]a;[0..f — 1], 0 < i < m, the 
i-th rotation of x, where = x. Given two words x and y, we define x y 
if and only if there exist f, 0 < i < |a;|, such that y = A circular word x 
is a conjugacy class of the equivalence relation Given a circular word x, any 
(linear) word x in the equivalence class x is called a linearization of the circular 
word X. Gonversely, given a linear word x, we say that i is a circularization of x 
if and only if x is a linearization of x. The set Tx of factors of the circular word 
X is equal to the set Txx H of factors of xx whose length is at most \x\, 

where x is any linearization of x. 

Note that if and x^^'^ are two rotations of the same word, then the factorial 
languages J-f^x('))‘ and Tf^xd))* coincide, so one can unambiguously define the 
(infinite) language Aj* as the language iFx *, where x is any linearization of x. 

In Section m we give the definition of the set Ms of minimal absent words 
for a circular word x. We will prove that the following problem can be solved 
with the same time and space complexity as its counterpart in the linear case. 
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MAW-CircularSequenceComparison 
Input: a word x of length m and a word y of length n 
Output: LW(Al£, Afy), where M.x and M.y denote the sets of minimal ab¬ 
sent words of the circularizations i of x and y of y, respectively. 


3 Sequence Comparison 

The goal of this section is to provide the first linear-time and linear-space algo¬ 
rithm for computing the similarity measure (see Section [2]) between two words 
defined over a fixed-sized alphabet. To this end, we consider two words x and 
y of lengths m and n, respectively, and their associated sets of minimal absent 
words, M.X and Aty, respectively. Next, we give a linear-time and linear-space 
solution for the MAW-SequenceComparison problem. It is known from 
and [2] that we can compute the sets M.x and A4y in linear time and space 
with respect to the two lengths m and n, respectively. The idea of our strategy 
consists of a merge sort on the sets A4x and My, after they have been ordered 
with the help of suffix arrays. 

To this end, we construct the suffix array associated to the word w = xy, to¬ 
gether with the implicit LCP array corresponding to it. All of these structures can 
be constructed in time and space 0{m + n), as mentioned earlier. Furthermore, 
we can preprocess the array LCP for range minimum queries, which we denote by 
RMQlcp [13]. With the preprocessing complete, the longest common prefix LCE 
of two suffixes of w starting at positions p and q can be computed in constant 
time [12], using the formula LCE{w,p,q) = LCP[RMQLCp(iSA[p] -|- l,iSA[g])]. 

Using these data structures, it is straightforward to sort the tuples in the 
sets Mx and My lexicographically. That is, two tuples Xi,X 2 € Mx, are ordered 
according to the letter following their longest common prefix, or when it is not 
the case, with the one being the prefix, coming first. To do this, we simply go 
once through the suffix array associated to w and assign to each tuple in Mx, 
respectively My, the rank of the suffix starting at the position indicated by 
its second component, in the suffix array. Since sorting an array of n distinct 
integers, such that each is in [0,u — 1], can be done in time 0{n) (using bucket 
sort, for example), we can sort now each of the sets of minimal absent words, 
taking into consideration the letter on the first position and these ranks. Thus, 
from now on, we assume that Mx = (a^o, x\,... ,Xk) where Xi is lexicographically 
smaller than Xi+i, for 0 < i < fc < am, and My = {y^, yi,..., yi), where yj is 
lexicographically smaller than yj+i, for 0 < j < £ < an. 

Provided these tools, we now proceed to do the merge. Thus, considering 
that we are analysing the {i + l)th tuple in Mx and the (j + l)th tuple in My, 
we note that the two are equal if and only if Xi [0] = yj [0] and 

LCE(ri;,a:i[l], \x\ + yj[l]) > i, where I = Xi[2] - Xi[l] = yj[2] - yj[l]. 

In other words, the two minimal absent words are equal if and only if their first 
letters coincide, they have equal length £ -I- 1, and the longest common prefix of 
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the suffixes of w starting at the positions indicated by the second components 
of the tuples has length at least £. 

Such a strategy will empower us with the means for constructing a new set 
■Mxy = -Mx At each step, when analysing tuples Xi and yj we proceed as 

following: 



and increment i, 

if Xi < yj 

•^^xy — \ •^^xy U 

and increment j. 

if Xi > yj 

y U = 

yj}, and increment both i and j, 

II 


Observe that the last condition is saying that basically each common tuple is 
added only once to their union. 

Furthermore, simultaneously with this construction we can also calculate the 
similarity between the words, given by LW(A^ 2 :, Ady), which is initially set to 0. 
Thus, at each step, when comparing the tuples Xi and we update 

{ LW(Adx,Ady) + and increment z, if Xi < yf, 

My) + 1 ^, and increment j, if Xi > yf, 

VSN[Mx, My), and increment both i and j, if Xi = yj. 

We impose the increment of both i and j in the case of equality as in this case we 
only look at the symmetric difference between the sets of minimal absent words. 

As all these operations take constant time, once per each tuple in Mx and 
My, it is easily concluded that the whole operation takes in the case of a fixed¬ 
sized alphabet time and space 0{m + n). Thus, we can compute the symmetric 
difference between the complete sets of minimal absent words, as opposed to [7], 
of two words defined over a fixed-sized alphabet, in linear time and space with 
respect to the lengths of the two words. We obtain the following result. 

Theorem 1. Problem MAW-SequenceComparison can he solved in time 
and space 0{m + n). 

4 Circular Sequence Comparison 

Next, we discuss two possible definitions for the minimal absent words of a 
circular word, and highlight the differences between them. 

We start by recalling some basic facts about minimal absent words. For 
further details and references the reader is recommended m- Every factorial 
language L is uniquely determined by its (antifactorial) language of minimal 
absent words Ml, through the equation L = E* \ E*Ml^*■ The converse 
is also true, since by the definition of a minimal absent word we have Ml = 
EL n LE n (A* \ L). The previous equations define a bijection between factorial 
and antifactorial languages. Moreover, this bijection preserves regularity. In the 
case of a single (linear) word x, the set of minimal absent words for x is indeed 
the antifactorial language . Furthermore, we can retrieve x from its set of 
minimal absent words in linear time and space [9]. 
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Recall that given a circular word the set of factors of x is equal to the 
set Txx n of factors of xx whose lengths are at most |a:|, where x is any 

linearization of x. Since a circular word i is a conjugacy class containing all the 
rotations of a linear word x, the language Ti, can be seen as the factorial closure 
of the set | i = 0,..., |a;| — 1}. This leads to the first definition of the set 
of minimal absent words for i, that is the set M-Ts; = {aub \ a,b C aub ^ 
Tx,au, ub € For instance, if x = abaab, we have 

AdjF- = {aaa, aabaa, aababa, abaaba, ababaa, baobab, babaab, babab, bb}. 

The advantage of this definition is that we can retrieve uniquely x from . 
However, the total size of (that is, the sum of the lengths of its elements) 
can be very large, as the following lemma suggests. 

Lemma 2. Let x be a circular word of length m > 0. The set contains pre¬ 
cisely £ words of maximal length m-l-1, where £ is the number of distinct rotations 
of any linearization x of x, that is, the cardinality of | * = 0,..., |x| — 1}. 

Proof. Let x = x[0]x[l].. x[to — 1] be a linearization of x. The word obtained by 
appending to x its first letter, x[0]x[l]. .x[m — l]x[0], belongs to since it 

has length m -I- 1, hence it cannot belong to Aj, but its maximal proper prefix 
X = x^°^ and its maximal proper suffix x^^^ = x[l].. x[to — l]x[0] belong to Aj. 

The same argument shows that for any rotation x^®^ = x[f]x[z -|- 1]. .x[m — 
l]x[0 ]. .x[i — 1] of X, the word x[i]x[i -I- 1].. x[m — l]x[0].. x[i — l]x[i], obtained 
by appending to x^®^ its first letter, belongs to 

Conversely, if a word of maximal length m -|- 1 is in , then its maximal 
proper prefix and its maximal proper suffix are words of length m in Aj, so they 
must be consecutive rotations of x. 

Therefore, the number of words of maximal length m -I- 1 in equals the 

number of distinct rotations of x, hence the statement follows. □ 

This is in sharp contrast with the situation for linear words, where the set 
of minimal absent words can be represented on a trie having size linear in the 
length of the word. Indeed, the algorithm MF-trie, introduced in [9], builds 
the tree-like deterministic automaton accepting the set of minimal absent words 
for a word x taking as input the factor automaton of x, that is the minimal 
deterministic automaton recognizing the set of factors of x. The leaves of the 
trie correspond to the minimal absent words for x, while the internal states are 
those of the factor automaton. Since the factor automaton of a word x has less 
than 2|x| states (for details, see [8j), this provides a representation of the minimal 
absent words of a word of length n in space 0{an). 

This algorithmic drawback leads us to the second definition. This second def¬ 
inition of minimal absent words for circular strings has been already introduced 
in First, we give a combinatorial result which shows that when consider¬ 

ing circular words it does not make sense to look at absent words obtained from 
more than two rotations. 
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Lemma 3. For any positive integer k and any word u, the set V = {v \ /c|w|+l < 
kl < (fc + n (A4uk+i \ A4u>=) is empty. 

Proof. This obviously holds for all words u of length 1. Assume towards a con¬ 
tradiction that this is not the case in general. Hence, there must exist a word 
V of length m that fulfills the conditions in the lemma, thus v € V and m > 2. 
Furthermore, since the length m — 1 prefix and the length m — 1 suffix of every 
minimal absent word occur in the main word at non-consecutive positions, there 
must exist positions i < j < n = \u\ such that 

v[l. .m — 2] = u^^^[i — 2] = -b 1.. j -I- m — 2]. (1) 

Obviously, following Equation o, since m — 2 > kn, we have that u[l.. m — 2] 
is (j — i)-periodic. But, we know that u[l..m — 2] is also n-periodic. Thus, 
following a direct application of the periodicity lemma we have that v[l.. m — 2] 
is p = gcd(j — i, n)-periodic. But, in this case we have that u is p-periodic, 
and, therefore, u[i] = u[j], which leads to a contradiction with the fact that 
u is a minimal absent word, whenever i is defined. Thus, it must be the case 
that i = —1. Using the same strategy and looking at positions u[i + m — 2] and 
u[j +m — 2], we conclude that j + m — 2 = {k + l)n. Therefore, in this case, we 
have that m = kn + 1, which is a contradiction with the fact that the word v 
fulfills the conditions of the lemma. This concludes the proof. □ 

Observe now that the set V consists in fact of all extra minimal absent words 
generated whenever we look at more than one rotation, that do not include the 
length arguments. That is, V does not include the words bounding the maximum 
length that a word is allowed, nor the words created, or lost, during a further 
concatenation of an image of u. However, when considering an iterative con¬ 
catenation of the word, these extra elements determined by the length constrain 
cancel each other. 

As observed in Section [2l two rotations of the same word x generate two 
languages that have the same set of factors. So, we can unambiguously associate 
to a circular word x the (infinite) factorial language JFx* ■ It is therefore natural 
to define the set of minimal absent words for the circular word x as the set 
. For instance, if i = abaab, then we have 

Ad. 7 br. = {aaa,aabaa,babab,bb}. 

This second definition is much more efficient in terms of space, as we show 
below. In particular, the length of the words in is bounded from above by 

I a; I, hence is a finite set. 

Recall that a word a: is a power of a word y if there exists a positive integer 
fc > 1 such that X is expressed as k consecutive concatenations of y, denoted 
hy X = y^. Conversely, a word x is primitive ii x = y^ implies k = 1. Notice 
that a word is primitive if and only if any of its rotation is. We can therefore 
extend the definition of primitivity to circular words. The definition of 
does not allow one to uniquely reconstruct x from Af , unless x is known 
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to be primitive, since it is readily verified that Tx- = and therefore also 
the minimal absent words of these two languages coincide. However, from the 
algorithmic point of view, this issue can be easily managed by storing the length 
\x\ of a linearization x oi x together with the set . Moreover, in most 

practical cases, for example when dealing with biological sequences, it is highly 
unlikely that the circular word considered is not primitive. 

The difference between the two definitions above is presented in the next 
lemma. 

Lemma 4. bl 

Proof. Clearly, = J^x- The statement then follows from the definition 

of minimal absent words. □ 

Based on the previous discussion, we set Ms = while the following 

corollary comes straightforwardly as a consequence of Lemma [3l 

Ixl 

Corollary 5. Let x be a cireular word. Then A4x = M.xx. 

Corollary [5] was first introduced as a definition for the set of minimal absent 
words of a circular word in |26) . Using the result of Corollary [Sj we can easily 
extend the algorithm described in the previous section to the case of circular 
words. That is, given two circular words x of length m and y of length n, we can 
compute in time and space 0{m + n) the quantity LW{A4x,My). We obtain the 
following result. 

Theorem 6. Problem MAW-CircularSequenceComparison can he solved 
in time and space 0(m -\- n). 

5 Implementation and Applications 

We implemented the presented algorithms as programme scMAW to perform 
pairwise sequence comparison for a set of sequences using minimal absent words, 
sc MAW uses programme MAW [2] for linear-time and linear-space computation 
of minimal absent words using suffix array, sc MAW was implemented in the C 
programming language and developed under GNU/Linux operating system. It 
takes, as input argument, a file in MultiFASTA format with the input sequences, 
and then any of the two methods, for linear or eircular sequence comparison, can 
be applied. It then produces a file in PHYLIP format with the distance matrix 
as output. Cell [x, y] of the matrix stores \}N{M.x,My) (or LW(A4£, My) for the 
circular case). The implementation is distributed under the GNU General Pub¬ 
lic License (GPL), and it is available at http://github.com/solonasl3/maw, 
which is set up for maintaining the source code and the man-page documenta¬ 
tion. Notice that all input datasets and the produced outputs referred to in this 
section are publicly maintained at the same web-site. 

An important feature of the proposed algorithms is that they require space 
linear in the length of the sequences (see Theorem[T]and Theorem|6]). Hence, we 
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were also able to implement scMAW using the Open Multi-Processing (OpenMP) 
PI for shared memory multiprocessing programming to distribute the workload 
across the available processing threads without a large memory footprint. 

Application. Recently, there has been a number of studies on the biological 
significance of absent words in various species [T fl6l3T] . In [16], the authors 
presented dendrograms from dinucleotide relative abundances in sets of minimal 
absent words for prokaryotes and eukaryotic genomes. The analyses support the 
hypothesis that minimal absent words are inherited through a common ancestor, 
in addition to lineage-specific inheritance, only in vertebrates. Very recently, 
in [81] . it was shown that there exist three minimal words in the Ebola virus 
genomes which are absent from human genome. The authors suggest that the 
identification of such species-specific sequences may prove to be useful for the 
development of both diagnosis and therapeutics. 

In this section, we show a potential application of our results for the construc¬ 
tion of dendrograms for DNA sequences with circular structure. Circular DNA 
sequences can be found in viruses, as plasmids in archaea and bacteria, and in 
the mitochondria and plastids of eukaryotic cells. Circular sequence comparison 
thus finds applications in several contexts such as reconstructing phytogenies us¬ 
ing viroids RNA [M] or Mitochondrial DNA (MtDNA) [17j . Conventional tools 
to align circular sequences could yield an incorrectly high genetic distance be¬ 
tween closely-related species. Indeed, when sequencing molecules, the position 
where a circular sequence starts can be totally arbitrary. Due to this arbitrari¬ 
ness, a suitable rotation of one sequence would give much better results for a 
pairwise alignment [4118] . In what follows, we demonstrate the power of minimal 
absent words to pave a path to resolve this issue by applying Corollary [5] and 
Theorem | 6 | Next we do not claim that a solid phylogenetic analysis is presented 
but rather an investigation for potential applications of our theoretical findings. 

We performed the following experiment with synthetic data. First, we sim¬ 
ulated a basic dataset of DNA sequences using INDELible [14] . The number of 
taxa, denoted by a, was set to 12 ; the length of the sequence generated at the 
root of the tree, denoted by /3, was set to 2500bp; and the substitution rate, 
denoted by 7 , was set to 0.05. We also used the following parameters: a deletion 
rate, denoted by 5, of 0.06 relative to substitution rate of 1 ; and an insertion 
rate, denoted by e, of 0.04 relative to substitution rate of 1. The parameters 
were chosen based on the genetic diversity standard measures observed for sets 
of MtDNA sequences from primates and mammals [4]. We generated another 
instance of the basic dataset, containing one arbitrary rotation of each of the 
a sequences from the basic dataset. We then used this randomized dataset as 
input to scMAW by considering LW(A 45 , A4y) as the distance metric. The out¬ 
put of scMAW was passed as input to NINJA [33], an efficient implementation 
of neighbor-joining [30], a well-established hierarchical clustering algorithm for 
inferring dendrograms (trees). We thus used NINJA to infer the respective tree 
Ti under the neighbor-joining criterion. We also inferred the tree T 2 by following 
the same pipeline, but by considering LW(AIa;, M.y) as distance metric, as well as 
the tree T 3 by using the basic dataset as input of this pipeline and LW(A 45 , My) 
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Dataset < a, 7, <5, e > 

Ti vs. T3 

T2 vs. T3 

< 12.2500,0.05,0.06,0.04 > 

100% 

100% 

< 12.2500,0.20,0.06,0.04 > 

100% 

88,88% 

< 12.2500,0.35,0.06,0.04 > 

100% 

100% 

< 25.2500,0.05,0.06,0.04 > 

100% 

100% 

< 25,2500,0.20,0.06,0.04 > 

100% 

100% 

< 25.2500,0.35,0.06,0.04 > 

100% 

100% 

< 50,2500,0.05,0.06,0.04 > 

100% 

97,87% 

< 50,2500,0.20,0.06,0.04 > 

100% 

97,87% 

< 50,2500,0.35,0.06,0.04 > 

100% 

100% 


Table 1. Accuracy measurements based on relative pairwise RF distance 


as distance metric. Hence, notice that T 3 represents the original tree. Finally, we 
computed the pairwise Robinson-Foulds (RF) distance [29] between: Ti and T 3 ; 
and T 2 and T 3 . 

Let us define accuracy as the difference between 1 and the relative pairwise 
RF distance. We repeated this experiment by simulating different datasets < 
a,/ 3 , 7 ,( 5 , e > and measured the corresponding accuracy. The results in Tabled) 
(see Ti vs. T 3 ) suggest that by considering LW(Afj,Adg) we can always re¬ 
construct the original tree even if the sequences have first been arbitrarily rotated 
(Corollary [5|). This is not the case (see T 2 vs. T 3 ) if we consider \^{M.x,M.y)- 
Notice that 100% accuracy denotes a (relative) pairwise RF distance of 0. 

6 Final Remarks 

In this article, complementary to measures that refer to the composition of se¬ 
quences in terms of their constituent patterns, we considered sequence compar¬ 
ison using minimal absent words, information about what does not occur in the 
sequences. We presented the first linear-time and linear-space algorithm to com¬ 
pare two sequences by considering all their minimal absent words (Theorem [!])• 
In the process, we presented some results of combinatorial interest, and also 
extended the proposed techniques to circular sequences. The power of minimal 
absent words is highlighted by the fact that they provide a tool for sequence com¬ 
parison that is as efficient for circular as it is for linear sequences (Corollary [5] 
and Theorem 0 ); whereas, this is not the case, for instance, using the general 
edit distance model m- Finally, a preliminary experimental study shows the 
potential of our theoretical findings. 

Our immediate target is to consider the following incremental version of the 
same problem: given an appropriate encoding of a comparison between sequences 
X and y, can one incrementally compute the answer for x and ay, and the answer 
for X and ya, efficiently, where a is an additional letter? Incremental sequence 
comparison, under the edit distance model, has already been considered in [ 20 ] . 

In [18] , the authors considered a more powerful generalization of the y-gram 
distance (see [32] for definition) to compare x and y. This generalization com¬ 
prises partitioning x and y in /3 blocks each, as evenly as possible, computing 
the y-gram distance between the corresponding block pairs, and then summing 
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up the distances computed blockwise to obtain the new measure. We are also 
planning to apply this generalization to the similarity measure studied here and 
evaluate it using real and synthetic data. 
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